-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File #1914
base: master
Are you sure you want to change the base?
Conversation
398ee03
to
ea8ba4b
Compare
what a cool pull request. not that i'd ever need it - but the principle is a great show case :) tried to merge here: https://github.com/ripmeapp2/ripme , but i then wondered how to see within a couple of seconds now and in future if it works. you mind doing a tiny unit test just, maybe in the lines of: |
@soloturn I've added some tests for this. Please let me know if you want me to change anything, especially regarding style. I'm both new to this codebase and the Java world in general! |
It seems to works, the only "downside" is that project seems to be abandoned |
thank you @SelfOnTheShelf ! 3 tiny things if you could adjust please:
|
Category
This change is exactly one of the following (please change
[ ]
to[x]
) to indicate which:Description
This feature adds support to use Redis as the mechanism for skipping already downloaded URLs. If you use RipMe over a longer period of time, to download many, many galleries and albums, the url_history.txt file gets quite large. Doing an O(n) scan through the entire list for every URL in a job becomes VERY expensive. My own url_history.txt file is approaching 3 million lines and 130 MB. Using Redis speeds up the ripping process considerably AND allows power users the ability to coordinate jobs running across multiple machines on a network.
Users can optionally add the following lines to the rip.properties file:
If users do not add this configuration, the URL matching algorithm now uses a HashSet. This is memory intensive, but performs faster than the sequential scan.
Note: RipMe will continue to append new lines to the url_history.txt file since this operation does not seem to slow down the job (...at least at the scales that I have encountered)
Note 2: The easiest way to run redis locally is to use docker (Something like
docker run --name my-redis -d -p 6379:6379 redis
). Alternatively you could download and install redis for your OS.Testing
Required verification:
mvn test
(there are no new failures or errors).Optional but recommended: